gh-148706: Pure-Python pickle._Pickler can be significantly optimized#148707
Open
mjbommar wants to merge 9 commits intopython:mainfrom
Open
gh-148706: Pure-Python pickle._Pickler can be significantly optimized#148707mjbommar wants to merge 9 commits intopython:mainfrom
mjbommar wants to merge 9 commits intopython:mainfrom
Conversation
…/ _batch_setitems
The C accelerator has batch_list_exact (Modules/_pickle.c:3179) and
batch_dict_exact (Modules/_pickle.c:3455) that skip the generic
batched() + enumerate() overhead for exact list / dict instances under
binary protocols. The pure-Python _batch_appends and _batch_setitems
didn't have equivalents.
This adds:
_batch_appends_exact(obj) — slice-snapshot per batch, no
itertools.batched generator, no
enumerate() object.
_batch_setitems_exact(obj) — for dicts <= BATCHSIZE, a single
direct iteration of obj.items(); for
larger dicts, a materialized list
with sliced batches.
Activated only when type(obj) is list / dict and self.bin is true.
Other types and proto 0 still use the generic paths unchanged.
Pure-Python dump numbers (best-of-9 median per bench):
list_of_ints_10k dump -3.1% load -2.6%
list_of_strs_1k dump -1.7% load -3.2%
dict_str_int_5k dump -0.2% load -3.5% (>BATCHSIZE)
deep_list dump -17.2% load -0.3%
nested_list_of_dicts dump -22.1% load -1.1%
Load deltas are unrelated micro-noise; the load path was not touched.
Correctness:
test_pickle 1060/1060 pass
test_pickletools 202/202 pass
test_copy 83/83 pass
test_copyreg 6/6 pass
test_importlib 1217/1217 pass
dill 0.4.1 29/30 pass (pre-existing 3.15a8 compat)
cloudpickle 3.1.2 236/236 + 22 skipped + 2 xfailed (identical)
The test_evil_pickler_mutating_collection case motivated using a
per-batch slice snapshot for lists (so concurrent mutation doesn't
raise IndexError) and relying on dict's built-in size-change check
for dicts.
…ave()
Two follow-ups to the exact-container fast paths (94b53eb):
D) Inline framer.commit_frame() at the top of save(). The hot check
(current_frame is None, or tell() < _FRAME_SIZE_TARGET) runs on every
save() call; skipping the Python method dispatch when no commit is
needed removes a measurable per-call tax on long runs.
E) Short-circuit the dispatch-table dict.get for the common atomic
types (str, int, NoneType, bool, float) by matching type(obj) with
direct `is` checks before falling through to `self.dispatch.get(t)`.
Placed after the memo and reducer_override checks so semantics for
repeated strings, subclasses, and custom reducers are unchanged.
Pure-Python dump numbers (best-of-9 median per bench, vs `main` with
no pickle patches):
list_of_ints_10k dump -11.7% load -3.9%
list_of_strs_1k dump -8.9% load -3.4%
dict_str_int_5k dump -9.9% load -4.4%
deep_list dump -24.6% load -2.9%
nested_list_of_dicts dump -28.5% load -2.8%
(Cumulative with the exact-container fast paths from 94b53eb.) The
load deltas are unrelated noise; the load path was not touched.
Correctness:
test_pickle 1060/1060 pass
test_pickletools 202/202 pass
test_copy 83/83 pass
test_copyreg 6/6 pass
test_importlib 1217/1217 pass
dill 0.4.1 29/30 pass (pre-existing 3.15a8 incompat)
cloudpickle 3.1.2 236/236 + 22 skipped + 2 xfailed (identical)
Investigations that did NOT ship:
B) Hoisting persistent_id / reducer_override hook checks to __init__
— precomputed bool + __dict__ probe is strictly more work than the
original self.persistent_id(obj) call, which hits the type-attribute
cache. Rejected twice on clean measurements.
C) Atomic-tuple memoize skip — semantically safe but changes byte-
exact pickle output, breaking test_pickle_to_2x's fixture assertion.
Deferred; would require updating DATA_SET2 / DATA_XRANGE.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the pure-Python _Pickler fast-path investigation that produced commits 94b53eb and bb9d721. Mirrors the structure of Misc/marshal-perf-diary.md. Covers five experiments (A-E) — what shipped (Exp 4 exact-container fast paths; D inlined commit_frame; E atomic-type is-short-circuit), what was rejected (B hook hoisting — Python's type-attribute cache beats manual short-circuits), and what was deferred (C atomic-tuple memoize skip — blocked by byte-exact pickle-output test fixture). Includes the kernel-compile-contamination lesson and the methodology that caught it on the clean rerun. Adds Misc/pickle-perf-data/ with the six canonical JSON artifacts plus the bench / profile scripts used throughout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(F1)
Two changes:
1) _batch_setitems_exact large-dict branch (`n > _BATCHSIZE`) used to
materialise `list(obj.items())` upfront, silently losing the
`RuntimeError("dictionary changed size during iteration")` raised
by dict_items when mutation happens during save() (e.g. from a
persistent_id hook). This is the exact-dict semantics regression
flagged in review. Fix: delegate to the generic `_batch_setitems`,
which iterates through `batched()` over the live items view and
preserves the detection. The small-dict `n <= _BATCHSIZE` fast path
already used direct `for k, v in items:` iteration which triggers
the same check.
2) Reorder save()'s fast paths to match the dispatch order in
Modules/_pickle.c::save(): atomic types (str / int / None / bool /
float) are tested *before* the memo.get()/reducer_override probe.
The C reference skips both for these types — str because save_str
handles memoisation inline, the others because they're never
memoized. Pure Python now does the same:
- str: inline memo.get check, then save_str on miss (preserves dedup)
- int / None / bool / float: direct save_*, no memo.get, no
reducer_override probe
Pure-Python dump numbers vs clean `main` (best-of-9 median):
list_of_ints_10k dump -38.6% load -3.8%
list_of_strs_1k dump -17.2% load -3.6%
dict_str_int_5k dump -25.9% load -4.2%
deep_list dump -44.4% load -3.1%
nested_list_of_dicts dump -34.9% load -2.8%
Load deltas are unrelated noise. The semantic change (skip
reducer_override for atomic types) matches the C pickler's long-
established behaviour.
test_pickle 1060/1060, test_pickletools 202/202, test_copy 83/83,
cloudpickle 236/236 identical breakdown. Large-dict mutation detection
verified via a local reproducer (RuntimeError raised as expected).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
save_long for small non-negative ints did `BININT1 + pack("<B", obj)`
per call. Precompute the 256 two-byte opcode sequences once at module
import (about 50 KB) and use a tuple index instead. Eliminates a
_struct.pack call on every such save.
Pure-Python dump numbers vs the previous commit (F1):
list_of_ints_10k dump -0.0% (values 0-9999, only 256 hit
the cache)
list_of_strs_1k dump -1.4%
dict_str_int_5k dump -1.4%
deep_list dump -9.9% ([i]*10 for i in 0..499 —
every value hits the cache)
nested_list_of_dicts dump -1.5%
Cumulative vs clean `main`: -18% to -50% dump depending on workload.
test_pickle 1060/1060, cloudpickle 236/236.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
memoize() used to call self.put(idx) which returns an opcode bytestring
and then write it. For protocol >= 4 (the common case for any recent
user), put() just returns the single-byte MEMOIZE constant. Inline the
protocol dispatch directly into memoize() to skip the method call and
the temporary-bytes indirection.
Also caches `memo = self.memo` as a local to avoid two attribute loads
on the common path.
Pure-Python dump numbers vs the previous commit (F4):
list_of_ints_10k dump +1.0% (ints not memoized; noise)
list_of_strs_1k dump -1.9%
dict_str_int_5k dump -1.2%
deep_list dump -1.3%
nested_list_of_dicts dump -1.6%
Cumulative vs clean `main`: -20% to -51% dump.
put() method kept for subclass override compatibility.
test_pickle 1060/1060.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `bytes` to the save() dispatch fast path, placed AFTER the four
non-memoized atomics (int / None / bool / float) so int-heavy workloads
don't pay an extra type-identity branch before hitting their direct
dispatch. bytes is memoized, so memo.get is inlined alongside the
save_bytes call the same way str is.
Dump deltas on a bytes-heavy workload:
list_of_short_bytes_5k dump -10.7%
list_of_medium_bytes_1k dump -11.7%
dict_bytes_to_int_2k dump -9.1%
list_of_bytearrays_1k dump +0.4% (bytearray isn't caught;
hits dispatch table)
Main-bench impact (F2 -> F6v2), within ±0.7% across all benches.
bytearray specialisation is a separate question; rare in real payloads.
test_pickle 1060/1060.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expands Misc/pickle-perf-diary.md to cover the round-2 experiments:
F1 SHIPPED — atomic fast paths before memo (matches C pickler order)
F2 SHIPPED — inline MEMOIZE proto-4+ in memoize()
F3 REJECTED — frame byte counter (Python bookkeeping loses to
BytesIO.tell()'s C method call)
F4 SHIPPED — cache BININT1 opcode bytes for n in 0..255
F5 REJECTED — ASCII save_str (Python utf-8 encoder already fast
for pure-ASCII)
F6 SHIPPED — bytes in save() fast path
F7 DEFERRED — exact-set batching (needs set workload)
Plus the large-dict mutation-detection regression in Exp 4 that
review flagged, fixed as part of F1.
Updated "What we learned" with four new lessons distilled from the
failures (F3, F5) and the order-matching insight from F1. Added a
"What we validated this session" section citing full CPython
regression suite passing (48,928 tests), dill 29/30 (pre-existing
3.15a8 incompat), cloudpickle 243/243 identical to baseline, joblib
focused subset identical to baseline.
Misc/pickle-perf-data/ adds nine round-2 JSON artifacts plus the
bytes-specific bench script, and README.md documents which file
represents which branch state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Announces the pure-Python _Pickler speedup stack (Exp 4 + D + E + F1 + F2 + F4 + F6, seven commits preceding this one). Highlights the dill impact and flags the single semantic change (atomic types skip reducer_override to match the C reference implementation).
Contributor
|
I think the best way to share the data and experimental results would be to put them in a dedicated repository and link to it here. Additionally, I'd recommend splitting this PR into smaller, more focused PRs—one optimization per PR. Before proceeding, it would be a good idea to get feedback from @serhiy-storchaka. |
Contributor
|
don't add json again and again |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As discussed in #148706 , packages like
dillcan be made ~20-40% faster by optimizing various paths in the pure Python pickler that they subclass.Note: I included a number of experiment logs and microbenchmark data for reviewers to examine; should obviously be removed before merging, but I didn't see any documentation on an alternative for sharing this kind of info.
A few key points, most of which are provided in more detail in the
Misc/pickle-perf-diary.mddoc andMisc/pickle-perf-data/results:attrs,dill,cloudpickle,joblib, etc. (3.15 already has some regressions)Some representative benchmarks:
Assisted-by: gpt-5-4-xhigh
Assisted-by: opus-4-7